The purpose of this analysis is to estimate the growth rates of bacterial species in the cecal microbiomes of turkey’s fed various amounts of BMD (bacitracin methylene disalicylate)
Taken from “Measurement of bacterial replication rates in microbial communities”
Christopher T Brown, Matthew R Olm, Brian C Thomas & Jillian F Banfield
Nature Biotechnology volume34, pages1256–1263 (2016)
Dividing cells in a natural population contain, on average, more than one copy of their genome. In an unsynchronized population of growing bacteria, cells contain genomes that are replicated to different extents, resulting in a gradual reduction in the average genome copy number from the origin to the terminus of replication. This decrease can be detected by measuring changes in DNA sequencing coverage across complete genomes. Bacterial genome replication proceeds bi-directionally from a single origin of replication, therefore the origin and terminus of replication can be deduced based on this coverage pattern. GC skew and genome coverage analyses of a wide variety of bacteria have shown that this replication mechanism is broadly applicable. Further, early studies of bacterial cultures revealed that cells can achieve faster division by simultaneously initiating multiple rounds of genome replication, which results in an average of more than two genome copies in rapidly growing cells.
iRep_explain
(a) Populations of bacteria undergoing rapid cell division differ from slowly growing populations in that the individual cells of a growing population are more actively in the process of replicating their genomes (purple circles). (b) Differences in genome copy number across a population of replicating cells can be determined based on sequencing read coverage over complete genome sequences. The ratio between the coverage at the origin (“peak”) and terminus (“trough”) of replication (PTR) relates to the replication rate of the population. The origin and terminus can be determined based on cumulative GC skew. (c,d) If no complete genome sequence is available, it is possible to calculate the replication rate based on the distribution of coverage values across a draft-quality genome using the iRep method. Coverage is first calculated across overlapping segments of genome fragments. Growing populations will have a wider distribution of coverage values compared with stable populations (histograms). These values are ordered from lowest to highest, and linear regression is used to evaluate the coverage distribution across the genome in order to determine the coverage values associated with the origin and terminus of replication. iRep is calculated as the ratio of these values. (e) Genome-resolved metagenomics involves DNA extraction from a microbiome sample followed by DNA sequencing, assembly, and genome binning. Binning is the grouping together of assembled genome fragments that originated from the same genome. This can be done based on shared characteristics of each fragment, such as sequence composition, taxonomic affiliation, or abundance.
exp_design
HiSeq 2x150 PE reads
1 library per bird = 90 libraries
2,480,187,759 paired reads
avg insert = 280 bp
PacBio RSII reads
library preps from pooled samples
1 library prep per timepoint
equal pool of all treatments
3 PacBio RSII library preps
3 SMRT cells?
7,178,582 total reads Not as long as I was expecting
#!/bin/bash
#SBATCH --job-name="ALL_QC" # name of the job submitted
#SBATCH -p mem # name of the queue you are submitting to
#SBATCH -n 40 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # nodes
#SBATCH --mem=1000G # memory allocation
#SBATCH -t 72:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=julestrachsel@gmail.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
module load java/1.8.0_121
set -e # I think this means if anything fails it immediately exits and doesnt keep plowing ahead.
#Written by Brian Bushnell
#Last updated March 4, 2019
#This script is designed to preprocess data for assembly of overlapping 2x150bp reads from Illumina HiSeq 2500.
#Some numbers and steps may need adjustment for different data types or file paths.
#For large genomes, tadpole and bbmerge (during the "Merge" phase) may need the flag "prefilter=2" to avoid running out of memory.
#"prefilter" makes these take twice as long though so don't use it if you have enough memory.
#The "rm ALL_temp.fq.gz; ln -s reads.fq.gz ALL_temp.fq.gz" is not necessary but added so that any pipeline stage can be easily disabled,
#without affecting the input file name of the next stage.
#interleave reads
#reformat.sh in=ALL_R#.fastq.gz out=ALL.fq.gz
# 92 fecal shotgun metagenomic libraries
# sequenced on HiSeq3000 2x150 PE
# I think avg insert sizes is ~280?
# this takes the R1 and R2 files and creates interleaved fastq.gz files for each
for x in *_R1.fastq.gz
do
#echo "${x%_R1*}_R2.fastq.gz"
reformat.sh in1="$x" in2="${x%_R1*}_R2.fastq.gz" out="${x%_R1*}.fq.gz"
done
# combines all interleaved reads into one file
cat *.fq.gz > ALL.fq.gz
#Link the interleaved input file as "ALL_temp.fq.gz"
ln -s ALL.fq.gz ALL_temp.fq.gz
# --- Preprocessing ---
#Remove optical duplicates
clumpify.sh in=ALL_temp.fq.gz out=ALL.clumped.fq.gz dedupe optical
rm ALL_temp.fq.gz; ln -s ALL.clumped.fq.gz ALL_temp.fq.gz
#Remove low-quality regions
filterbytile.sh in=ALL_temp.fq.gz out=ALL.filtered_by_tile.fq.gz
rm ALL_temp.fq.gz; ln -s ALL.filtered_by_tile.fq.gz ALL_temp.fq.gz
#Trim adapters. Optionally, reads with Ns can be discarded by adding "maxns=0" and reads with really low average quality can be discarded with "maq=8".
bbduk.sh in=ALL_temp.fq.gz out=ALL.trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered
rm ALL_temp.fq.gz; ln -s ALL.trimmed.fq.gz ALL_temp.fq.gz
#Remove synthetic artifacts and spike-ins by kmer-matching.
bbduk.sh in=ALL_temp.fq.gz out=ALL.filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
rm ALL_temp.fq.gz; ln -s ALL.filtered.fq.gz ALL_temp.fq.gz
#Decontamination by mapping can be done here.
#JGI removes these in two phases:
#1) common microbial contaminants (E.coli, Pseudomonas, Delftia, others)
#2) common animal contaminants (Human, cat, dog, mouse)
bbsplit.sh in=SAMPLE_temp.fq.gz ref=turkey.fa basename=SAMPLE_out_%.fq.gz outu=SAMPLE_clean.fq.gz int=t
rm SAMPLE_temp.fq.gz; ln -s SAMPLE_clean.fq.gz SAMPLE_temp.fq.gz
#Error-correct phase 1
bbmerge.sh in=ALL_temp.fq.gz out=ALL.ecco.fq.gz ecco mix vstrict ordered ihist=ihist_merge1.txt
rm ALL_temp.fq.gz; ln -s ALL.ecco.fq.gz ALL_temp.fq.gz
#Error-correct phase 2
clumpify.sh in=ALL_temp.fq.gz out=ALL.eccc.fq.gz ecc passes=4 reorder
rm ALL_temp.fq.gz; ln -s ALL.eccc.fq.gz ALL_temp.fq.gz
#Error-correct phase 3
#Low-depth reads can be discarded here with the "tossjunk", "tossdepth", or "tossuncorrectable" flags.
#For very large datasets, "prefilter=1" or "prefilter=2" can be added to conserve memory.
#Alternatively, bbcms.sh can be used if Tadpole still runs out of memory.
tadpole.sh in=ALL_temp.fq.gz out=ALL.ecct.fq.gz ecc k=62 ordered tossjunk prefilter=1
rm ALL_temp.fq.gz; ln -s ALL.ecct.fq.gz ALL_temp.fq.gz
#Normalize
#This phase can be very beneficial for data with uneven coverage like metagenomes, MDA-amplified single cells, and RNA-seq, but is not usually recommended for isolate DNA.
#So normally, this stage should be commented out, as it is here.
bbnorm.sh in=ALL_temp.fq.gz out=normalized.fq.gz target=100 hist=khist.txt peaks=peaks.txt
rm ALL_temp.fq.gz; ln -s normalized.fq.gz ALL_temp.fq.gz
This ran for about 50 hours.
#!/bin/bash
#SBATCH --job-name="metaspades" # name of the job submitted
#SBATCH -p mem # name of the queue you are submitting to
#SBATCH -n 100 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # number of nodes
#SBATCH --mem=1400G # memory allocation
#SBATCH -t 167:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=YOUREMAILHERE@email.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
module load java/1.8.0_121
# convert from interleaved to R1 R2 files
reformat.sh in=ALL_normalized.fq.gz out1=ALL_normalized_1.fq.gz out2=ALL_normalized_2.fq.gz
metaspades.py -o run1 -1 ALL_normalized_1.fq.gz -2 ALL_normalized_2.fq.gz --pacbio ALL_pacbio.fq.gz --only-assembler -m 1400 --tmp-dir "$TMPDIR" -t 100 -k 25,55,95,125| attribute | Megahit | Metaspades |
|---|---|---|
| Main genome scaffold total: | 1,486,743 | 1,370,062 |
| Main genome contig total: | 1,486,743 | 1,390,561 |
| Main genome scaffold sequence total: | 3849 MB | 3637 MB |
| Main genome contig sequence total: | 3849 MB | 3636 MB |
| Main genome scaffold N/L50: | 114231 / 6624 | 103354 / 6047 |
| Main genome contig N/L50: | 114231 / 6624 | 109063 / 5812 |
| Main genome scaffold N/L90: | 152813 / 5065 | 172075 / 3856 |
| Main genome contig N/L90: | 152813 / 5065 | 180267 / 3747 |
| Main genome assembly score: | 18.019 | 19.793 |
| Max scaffold length: | 1.063 MB | 1.137 MB |
| Max contig length: | 1.063 MB | 1.063 MB |
| Number of scaffolds > 50 KB: | 3,899 | 5,000 |
| Number of bases in scaffolds>50 KB: | 347 MB | 482 MB |
| % main genome in scaffolds > 50 KB: | 9.04% | 13.26% |
| min scaffold length | Megahit | Metaspades | Megahit | Metaspades |
|---|---|---|---|---|
| 500 | 1,486,743 | 1,370,062 | 3849 MB | 3637 MB |
| 1000 | 723,321 | 744,575 | 3325 MB | 3197 MB |
| 2500 | 306,489 | 281,246 | 2686 MB | 2483 MB |
| 5000 | 154,925 | 128,562 | 2158 MB | 1956 MB |
| 10000 | 69,333 | 56,088 | 1560 MB | 1454 MB |
| 25000 | 15,980 | 15,669 | 754 MB | 845 MB |
| 50000 | 3,899 | 5,000 | 347 MB | 482 MB |
| 100000 | 942 | 1,427 | 151 MB | 240 MB |
| 250000 | 76 | 162 | 27 MB | 58 MB |
| 500000 | 8 | 18 | 5.5 MB | 12.9 MB |
| 1000000 | 1 | 3 | 1.1 MB | 3.3 MB |
#!/bin/bash
#SBATCH --job-name="SAMPLE" # name of the job submitted
#SBATCH -p brief-low # name of the queue you are submitting to
#SBATCH -n 30 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # number of nodes
#SBATCH --mem=300G # memory allocation
#SBATCH -t 2:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=YOUREMAILHERE@email.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
module load java/1.8.0_121
set -e # I think this means if anything fails it immediately exits and doesnt keep plowing ahead.
#Written by Brian Bushnell
#Last updated March 4, 2019
#This script is designed to preprocess data for assembly of overlapping 2x150bp reads from Illumina HiSeq 2500.
#Some numbers and steps may need adjustment for different data types or file paths.
#For large genomes, tadpole and bbmerge (during the "Merge" phase) may need the flag "prefilter=2" to avoid running out of memory.
#"prefilter" makes these take twice as long though so don't use it if you have enough memory.
#The "rm SAMPLE_temp.fq.gz; ln -s reads.fq.gz SAMPLE_temp.fq.gz" is not necessary but added so that any pipeline stage can be easily disabled,
#without affecting the input file name of the next stage.
#interleave reads
reformat.sh in=SAMPLE_R#.fastq.gz out=SAMPLE.fq.gz
#Link the interleaved input file as "SAMPLE_temp.fq.gz"
ln -s SAMPLE.fq.gz SAMPLE_temp.fq.gz
# --- Preprocessing ---
#Remove optical duplicates
clumpify.sh in=SAMPLE_temp.fq.gz out=SAMPLE.clumped.fq.gz dedupe optical
rm SAMPLE_temp.fq.gz; ln -s SAMPLE.clumped.fq.gz SAMPLE_temp.fq.gz
#Remove low-quality regions
filterbytile.sh in=SAMPLE_temp.fq.gz out=SAMPLE.filtered_by_tile.fq.gz
rm SAMPLE_temp.fq.gz; ln -s SAMPLE.filtered_by_tile.fq.gz SAMPLE_temp.fq.gz
#Trim adapters. Optionally, reads with Ns can be discarded by adding "maxns=0" and reads with really low average quality can be discarded with "maq=8".
bbduk.sh in=SAMPLE_temp.fq.gz out=SAMPLE.trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5 ordered
rm SAMPLE_temp.fq.gz; ln -s SAMPLE.trimmed.fq.gz SAMPLE_temp.fq.gz
#Remove synthetic artifacts and spike-ins by kmer-matching.
bbduk.sh in=SAMPLE_temp.fq.gz out=SAMPLE.filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
rm SAMPLE_temp.fq.gz; ln -s SAMPLE.filtered.fq.gz SAMPLE_temp.fq.gz
# create a list of all samples:
for x in *R1*gz;
do
echo "${x%_S*}" >> samples.txt
done
# generate a SLURM script for each sample
while read line
do
cat pre_mapping_QC.SLURM | sed "s/SAMPLE/$line/g" > "$line"_tmp.SLURM
done < samples.txt
# After you confirm they look good, submit with:
for x in *_tmp.SLURM
do
sbatch $x
done #!/bin/bash
#SBATCH --job-name="SAMPLE" # name of the job submitted
#SBATCH -p mem-low # name of the queue you are submitting to
#SBATCH -n 40 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # number of nodes
#SBATCH --mem=200G # memory allocation
#SBATCH -t 2:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=julestrachsel@gmail.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
module load java/1.8.0_121
module load samtools
module load pigz
set -e # I think this means if anything fails it immediately exits and doesnt keep plowing ahead.
# FIRST GENERATING AN INDEXED REFERENCE WITH
# bbmap.sh ref=metaspades_500_scaff.fa
# then all jobs will just use this pre-indexed reference. I am doing this with an interactive shell prior to submitting the SLURMS
# not specifying reference because of preindexing listed above.
bbmap.sh in=../hiseq/SAMPLE.filtered.fq.gz outm=SAMPLE_mapped.sam.gz bs=SAMPLE.bamscript qtrim=10 untrim ambig=all pigz unpigz
# this bamscript will generate sorted & indexed bam files
chmod u+x SAMPLE.bamscript
./SAMPLE.bamscriptAgain, this script is just a template, I used it to create a SLURM script for each sample using a loop like the one previously mentioned.
This script will output sams as well as indexed bams for each sample mapping to the co-assembly.
#!/bin/bash
#SBATCH --job-name="metabat" # name of the job submitted
#SBATCH -p mem # name of the queue you are submitting to
#SBATCH -n 40 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # number of nodes
#SBATCH --mem=125G # memory allocation
#SBATCH -t 47:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=YOUREMAIL@email.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
module load metabat
module load checkm/v1.0.11
jgi_summarize_bam_contig_depths --outputDepth depth.txt --pairedContigs paired.txt --minContigLength 1000 --minContigDepth 2 *.bam
metabat -i metaspades_500_scaff.fa -a depth.txt -o run1/bin -v --saveTNF saved.tnf --saveDistance saved.dist
checkm lineage_wf -f run1/CheckM.txt -t 32 -x fa run1/ run1/SCGremove contigs shorter than 5kb
This is the python code I used to remove contigs shorter than 5kb from each bin as well as remove the bins with more than 200 contigs.
from Bio import SeqIO
import os
files = os.listdir()
for fasta in files:
new_genome = []
genome = SeqIO.parse(fasta, 'fasta')
for seq in genome:
if len(seq.seq) > 4999:
new_genome.append(seq)
if 0 < len(new_genome) < 200:
new_genome_handle = fasta.strip('.fa') + '_new.fasta'
SeqIO.write(new_genome, new_genome_handle, 'fasta')
else:
pass#!/bin/bash
#SBATCH --job-name="checkm" # name of the job submitted
#SBATCH -p short # name of the queue you are submitting to
#SBATCH -n 16
#SBATCH -N 1
#SBATCH --mem=48G # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -t 47:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=julestrachsel@gmail.com # will receive an email when job starts, ends or fails
#SBATCH --mail-type=BEGIN,END,FAIL # will receive an email when job starts, ends or fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
module load checkm
checkm lineage_wf -f out/CheckM.txt -t 16 -x fasta ./ ./out#!/bin/bash
#SBATCH --job-name="SAMPLE" # name of the job submitted
#SBATCH -p short # name of the queue you are submitting to
#SBATCH -n 8 # number of cores/tasks
#SBATCH -N 1
#SBATCH --mem=100G # memory allocation
#SBATCH -t 47:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=YOUREMAIL@email.com # will receive an email when job fails
#SBATCH --mail-type=FAIL # will receive an email when job fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
# activate the conda environment
source ~/iREP/bin/activate
gunzip SAMPLE_mapped.sam.gz
# allows iRep to tolerate 7 mismatches, over ~150bp reads this is about 95%
iRep -f *.fa -s SAMPLE_mapped.sam -o SAMPLE.out -mm 7 -t 8 --no-plot --sort#!/bin/bash
#SBATCH --job-name="7_compile" # name of the job submitted
#SBATCH -p short # name of the queue you are submitting to
#SBATCH -n 8 # number of cores/tasks in this job, you get all 20 cores with 2 threads per core with hyperthreading
#SBATCH -N 1 # nodes
#SBATCH --mem=32G # memory allocation
#SBATCH -t 47:00:00 # time allocated for this job hours:mins:seconds
#SBATCH --mail-user=julestrachsel@gmail.com # will receive an email when job fails
#SBATCH --mail-type=FAIL # will receive an email when job fails
#SBATCH -o "stdout.%j.%N" # standard out %j adds job number to outputfile name and %N adds the node name
#SBATCH -e "stderr.%j.%N" # optional but it prints our standard error
# ENTER COMMANDS HERE:
source ~/iREP/bin/activate
iRep_filter.py -c 3 -w 93 -t *tsv --long > ALL_RESULTS_longlibrary(tidyverse)
library(broom)
library(lme4)
library(lmerTest)
# CheckM bin info #
colnams <- c('bin', 'marker_lineage', 'num_genomes', 'num_markers', 'num_marker_sets', 'x0', 'x1', 'x2', 'x3','x4','x5+', 'Completeness', 'Contamination', 'Strain_heterogeneity')
checkm <- read_delim('CheckM_clean3.txt', delim = '\t', trim_ws = TRUE, skip = 2, col_names = colnams) %>%
mutate(bin=sub('_new', '', bin))
# read in the data and extract some metadata from sample names
dat <- read_tsv('ALL_RESULTS_BULK_MAP.txt', skip = 1, col_types = c('iccnnnnnnninn'))%>%
mutate(sample = sub('_mapped.sam','',sample),
genome = sub('.fa','',genome),
day = sub('d([0-9][0-9])([a-z]+)([0-9][0-9][0-9])','\\1',sample),
treatment = sub('d([0-9][0-9])([a-z]+)([0-9][0-9][0-9])','\\2',sample),
bird = sub('d([0-9][0-9])([a-z]+)([0-9][0-9][0-9])','\\3',sample),
day_num = as.numeric(day)) %>%
filter(!(is.na(iRep))) %>%
filter(!grepl('ex', sample)) %>%
select(-X1, -`fragments/Mbp`)## Warning: Missing column names filled in: 'X1' [1]
## Warning: 52298 parsing failures.
## row col expected actual file
## 1 iRep a number n/a 'ALL_RESULTS_BULK_MAP.txt'
## 1 fragments/Mbp no trailing characters .0 'ALL_RESULTS_BULK_MAP.txt'
## 2 iRep a number n/a 'ALL_RESULTS_BULK_MAP.txt'
## 2 fragments/Mbp no trailing characters .0 'ALL_RESULTS_BULK_MAP.txt'
## 3 iRep a number n/a 'ALL_RESULTS_BULK_MAP.txt'
## ... ............. ...................... ...... ..........................
## See problems(...) for more details.
## Warning: NAs introduced by coercion
dat <- dat %>% filter(coverage > 5) # only keep estimates from samples where bins had > 5x coverage (as reccommended)
# Histogram of all iRep estimates
dat %>% ggplot(aes(x=iRep)) +
geom_histogram(bins = 50) +
theme_bw() + ggtitle('Histogram of all valid iRep estimates')# some bins have extremely high coverage / relative abundance
dat %>% ggplot(aes(x=(coverage))) + geom_histogram(bins = 100) +
ggtitle('Histogram of bin coverages') +
theme_bw()dat %>%
ggplot(aes(x=log2(coverage))) +
geom_histogram(bins = 100) +
ggtitle('Histogram of log(coverage)') +
theme_bw()dat %>%
ggplot(aes(x=`relative abundance`)) +
geom_histogram(bins = 100) +
ggtitle('Histogram of relative_abundance') +
theme_bw()dat %>%
ggplot(aes(x=(log2(`relative abundance`)))) +
geom_histogram(bins = 100) +
ggtitle('Histogram of log(relative_abundance)') +
theme_bw()#### iRep is significantly negatively correlated with both coverage and relative abundance
cor.test(x = dat$iRep, log(dat$`relative abundance`))##
## Pearson's product-moment correlation
##
## data: dat$iRep and log(dat$`relative abundance`)
## t = -8.492, df = 2276, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2147699 -0.1351505
## sample estimates:
## cor
## -0.1752467
dat %>% ggplot(aes(x=log(`relative abundance`), y=iRep)) +
geom_point() +
geom_smooth(method = 'lm') +
theme_bw()# what about testing this within each bin?
valid_irep_totbin <- dat %>%
group_by(genome) %>%
tally() %>%
arrange(desc(n))
valid_irep_totbin %>%
ggplot(aes(x=n)) +
geom_histogram(bins = 50) +
xlab('number of valid iRep estimates')+
ggtitle('number of valid iRep estimates per bin') +
theme_bw()## [1] 2278
## [1] 244
# from 244 genomes
# How do these estimates look by treatment? by day?
# boxplots
dat %>%
ggplot(aes(x=treatment, y=iRep)) + geom_boxplot() + facet_wrap(~day) +
theme_bw()### neat, maybe a treatment effect?
### maybe a time effect?
### But, we dont know if we are measuring the same genomes growth rates in each of these categories...
##### main point here is that the data are sparse #######
# calculating which bins have enough datapoints available for meaningful stats
number_of_obs <- dat %>%
group_by(day, treatment, genome) %>%
tally() %>% spread(key = treatment, value=n) %>%
ungroup() %>% group_by(day, genome) %>%
mutate(num_NA = sum(is.na(c(ctrl, sub, ther))),
num_w_4p = sum(c(ctrl, sub, ther) >3, na.rm = TRUE)) %>%
ungroup() %>%
unite(col = 'bin_day', genome, day, remove = FALSE)
number_of_obs## # A tibble: 449 x 8
## bin_day day genome ctrl sub ther num_NA num_w_4p
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 bin.1004_07 07 bin.1004 NA NA 2 2 0
## 2 bin.1007_07 07 bin.1007 NA NA 1 2 0
## 3 bin.1011_07 07 bin.1011 1 NA NA 2 0
## 4 bin.1012_07 07 bin.1012 2 NA 6 1 1
## 5 bin.1016_07 07 bin.1016 2 2 3 0 0
## 6 bin.102_07 07 bin.102 1 NA 3 1 0
## 7 bin.103_07 07 bin.103 3 8 4 0 2
## 8 bin.1031_07 07 bin.1031 4 7 5 0 3
## 9 bin.1042_07 07 bin.1042 1 NA NA 2 0
## 10 bin.1044_07 07 bin.1044 9 NA NA 2 1
## # … with 439 more rows
# data grouped by bin and day to tally numobs per group
# These have at least 1 treatment group with 4+ observations at the given timepoint
one_group <- number_of_obs %>% filter(num_w_4p >= 1)
one_group## # A tibble: 138 x 8
## bin_day day genome ctrl sub ther num_NA num_w_4p
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 bin.1012_07 07 bin.1012 2 NA 6 1 1
## 2 bin.103_07 07 bin.103 3 8 4 0 2
## 3 bin.1031_07 07 bin.1031 4 7 5 0 3
## 4 bin.1044_07 07 bin.1044 9 NA NA 2 1
## 5 bin.110_07 07 bin.110 4 3 3 0 1
## 6 bin.127_07 07 bin.127 4 5 2 0 2
## 7 bin.175_07 07 bin.175 3 5 7 0 2
## 8 bin.205_07 07 bin.205 3 6 6 0 2
## 9 bin.209_07 07 bin.209 2 4 3 0 1
## 10 bin.231_07 07 bin.231 6 4 5 0 3
## # … with 128 more rows
#### These have at least 2 groups with 4+ observations
two_groups <- number_of_obs %>% filter(num_w_4p > 1)
two_groups## # A tibble: 70 x 8
## bin_day day genome ctrl sub ther num_NA num_w_4p
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 bin.103_07 07 bin.103 3 8 4 0 2
## 2 bin.1031_07 07 bin.1031 4 7 5 0 3
## 3 bin.127_07 07 bin.127 4 5 2 0 2
## 4 bin.175_07 07 bin.175 3 5 7 0 2
## 5 bin.205_07 07 bin.205 3 6 6 0 2
## 6 bin.231_07 07 bin.231 6 4 5 0 3
## 7 bin.235_07 07 bin.235 8 10 8 0 3
## 8 bin.280_07 07 bin.280 5 8 7 0 3
## 9 bin.313_07 07 bin.313 3 8 5 0 2
## 10 bin.316_07 07 bin.316 7 5 7 0 3
## # … with 60 more rows
# these have all 3 groups represented by at least 4 data points each
all_groups <- number_of_obs %>% filter(num_w_4p == 3)
all_groups## # A tibble: 27 x 8
## bin_day day genome ctrl sub ther num_NA num_w_4p
## <chr> <chr> <chr> <int> <int> <int> <int> <int>
## 1 bin.1031_07 07 bin.1031 4 7 5 0 3
## 2 bin.231_07 07 bin.231 6 4 5 0 3
## 3 bin.235_07 07 bin.235 8 10 8 0 3
## 4 bin.280_07 07 bin.280 5 8 7 0 3
## 5 bin.316_07 07 bin.316 7 5 7 0 3
## 6 bin.433_07 07 bin.433 5 5 6 0 3
## 7 bin.493_07 07 bin.493 5 4 5 0 3
## 8 bin.886_07 07 bin.886 6 8 9 0 3
## 9 bin.205_35 35 bin.205 4 4 5 0 3
## 10 bin.235_35 35 bin.235 9 4 5 0 3
## # … with 17 more rows
# no bin has 4 or more observations in every group at every timepoint
number_of_obs %>% select(genome, day, num_w_4p) %>%
spread(key = day, value = num_w_4p) %>%
mutate(all_times = `07`+ `35`+ `78`) %>% arrange(desc(all_times)) %>%
mutate(d7g = ifelse(`07` == 3, TRUE, FALSE),
d35g = ifelse(`35` == 3, TRUE, FALSE),
d78g = ifelse(`78` == 3, TRUE, FALSE))## # A tibble: 244 x 8
## genome `07` `35` `78` all_times d7g d35g d78g
## <chr> <int> <int> <int> <int> <lgl> <lgl> <lgl>
## 1 bin.316 3 3 2 8 TRUE TRUE FALSE
## 2 bin.280 3 3 1 7 TRUE TRUE FALSE
## 3 bin.716 1 3 3 7 FALSE TRUE TRUE
## 4 bin.235 3 3 0 6 TRUE TRUE FALSE
## 5 bin.796 1 3 2 6 FALSE TRUE FALSE
## 6 bin.205 2 3 0 5 FALSE TRUE FALSE
## 7 bin.306 0 2 3 5 FALSE FALSE TRUE
## 8 bin.735 2 2 1 5 FALSE FALSE FALSE
## 9 bin.886 3 1 1 5 TRUE FALSE FALSE
## 10 bin.988 1 3 1 5 FALSE TRUE FALSE
## # … with 234 more rows
# kindof makes sense, both ABX treatment and time affected the community composition
#
# Will probably have to stick to comparing iRep estimates in two main ways:
# 1) Within timepoint between treatments
# 2) within a treatment between timepoints
## one timepoint ##
# 8 bins have all treatment groups represented with 4+ observations at D7
D7 <- all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
filter(!is.na(`07`))
D7## # A tibble: 8 x 4
## genome `07` `35` `78`
## <chr> <int> <int> <int>
## 1 bin.1031 3 NA NA
## 2 bin.231 3 NA NA
## 3 bin.235 3 3 NA
## 4 bin.280 3 3 NA
## 5 bin.316 3 3 NA
## 6 bin.433 3 NA NA
## 7 bin.493 3 3 NA
## 8 bin.886 3 NA NA
# 11 bins have all treatment groups represented with 4+ observations at D35
D35 <- all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
filter(!is.na(`35`))
D35## # A tibble: 11 x 4
## genome `07` `35` `78`
## <chr> <int> <int> <int>
## 1 bin.205 NA 3 NA
## 2 bin.235 3 3 NA
## 3 bin.280 3 3 NA
## 4 bin.316 3 3 NA
## 5 bin.493 3 3 NA
## 6 bin.550 NA 3 NA
## 7 bin.642 NA 3 NA
## 8 bin.716 NA 3 3
## 9 bin.796 NA 3 NA
## 10 bin.826 NA 3 NA
## 11 bin.988 NA 3 NA
# 8 bins have all treatment groups represented with 4+ observations at D78
D78 <- all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
filter(!is.na(`78`))
D78## # A tibble: 8 x 4
## genome `07` `35` `78`
## <chr> <int> <int> <int>
## 1 bin.306 NA NA 3
## 2 bin.389 NA NA 3
## 3 bin.55 NA NA 3
## 4 bin.623 NA NA 3
## 5 bin.693 NA NA 3
## 6 bin.698 NA NA 3
## 7 bin.716 NA 3 3
## 8 bin.982 NA NA 3
### 2 timepoints ###
# 4 bins have all 3 groups with 4+ observations at both d7 and d35
all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
mutate(d7vd35 = `07` + `35`) %>%
filter(!is.na(d7vd35))## # A tibble: 4 x 5
## genome `07` `35` `78` d7vd35
## <chr> <int> <int> <int> <int>
## 1 bin.235 3 3 NA 6
## 2 bin.280 3 3 NA 6
## 3 bin.316 3 3 NA 6
## 4 bin.493 3 3 NA 6
# 1 bin has all 3 groups with 4+ observations at both d35 and d78
all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
mutate(d35vd78 = `35` + `78`) %>%
filter(!is.na(d35vd78))## # A tibble: 1 x 5
## genome `07` `35` `78` d35vd78
## <chr> <int> <int> <int> <int>
## 1 bin.716 NA 3 3 6
# no bin has all 3 groups with 4+ observations at d7 and d78
all_groups %>%
select(genome, day, num_w_4p) %>%
spread(key=day, value = num_w_4p) %>%
mutate(d07vd78 = `07` + `78`) %>%
filter(!is.na(d07vd78))## # A tibble: 0 x 5
## # … with 5 variables: genome <chr>, `07` <int>, `35` <int>, `78` <int>,
## # d07vd78 <int>
##### These sets of bins can be used to investigate time effect #####
##### within each treatment #####
# these ones can compare d7 vs d35 in ctrl
# 15 bins
ctrl735 <- number_of_obs %>%
# filter(num_w_4p >1) %>%
# filter(ctrl >3) %>%
select(-bin_day, -num_NA, -num_w_4p) %>%
gather(key = 'treat', value = 'count', -day, -genome) %>%
unite(col='treat_day', treat, day) %>% spread(key=treat_day, value=count) %>%
select(genome, starts_with('ctrl')) %>%
filter(ctrl_07 > 3 & ctrl_35 > 3)
ctrl735## # A tibble: 15 x 4
## genome ctrl_07 ctrl_35 ctrl_78
## <chr> <int> <int> <int>
## 1 bin.1044 9 10 NA
## 2 bin.127 4 5 NA
## 3 bin.231 6 5 NA
## 4 bin.235 8 9 2
## 5 bin.280 5 8 NA
## 6 bin.285 4 4 1
## 7 bin.316 7 9 NA
## 8 bin.322 7 8 NA
## 9 bin.459 5 7 1
## 10 bin.493 5 6 NA
## 11 bin.716 6 10 4
## 12 bin.735 6 9 2
## 13 bin.810 5 5 NA
## 14 bin.820 9 4 NA
## 15 bin.988 9 7 NA
# 8 bins
sub735 <- number_of_obs %>%
# filter(num_w_4p >1) %>%
# filter(ctrl >3) %>%
select(-bin_day, -num_NA, -num_w_4p) %>%
gather(key = 'treat', value = 'count', -day, -genome) %>%
unite(col='treat_day', treat, day) %>% spread(key=treat_day, value=count) %>%
select(genome, starts_with('sub')) %>%
filter(sub_07 > 3 & sub_35 > 3)
sub735## # A tibble: 8 x 4
## genome sub_07 sub_35 sub_78
## <chr> <int> <int> <int>
## 1 bin.205 6 4 NA
## 2 bin.235 10 4 2
## 3 bin.280 8 7 1
## 4 bin.316 5 8 4
## 5 bin.424 5 4 NA
## 6 bin.493 4 5 NA
## 7 bin.560 5 8 2
## 8 bin.735 4 7 2
# 11 bins
ther735 <- number_of_obs %>%
# filter(num_w_4p >1) %>%
# filter(ctrl >3) %>%
select(-bin_day, -num_NA, -num_w_4p) %>%
gather(key = 'treat', value = 'count', -day, -genome) %>%
unite(col='treat_day', treat, day) %>% spread(key=treat_day, value=count) %>%
select(genome, starts_with('ther')) %>%
filter(ther_07 > 3 & ther_35 > 3)
ther735## # A tibble: 11 x 4
## genome ther_07 ther_35 ther_78
## <chr> <int> <int> <int>
## 1 bin.1012 6 4 NA
## 2 bin.103 4 6 3
## 3 bin.205 6 5 NA
## 4 bin.231 5 6 NA
## 5 bin.235 8 5 1
## 6 bin.280 7 7 5
## 7 bin.316 7 10 9
## 8 bin.424 6 5 NA
## 9 bin.493 5 4 NA
## 10 bin.796 4 5 4
## 11 bin.886 9 7 1
# start to pair down the data to only include genomes that have enough data for comparisons?
##### only includes estimates from bins that have at least 1 treatment with 4+ observations in any one timepoint ##
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>% nrow()## [1] 1572
# 1572 observations
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>%
select(genome) %>% unlist() %>% unique() %>% length()## [1] 102
#102 genomes
# histogram
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>%
ggplot(aes(x=iRep)) +
geom_histogram(bins=50)+
theme_bw() +
ggtitle('Histogram of iRep estimates',
'considering only those genomes that have \nat least 1 treatment with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>%
group_by(genome) %>%
tally() %>%
ggplot(aes(x=n))+geom_histogram(bins=30) +
ggtitle('Valid iRep estimates per genome',
'considering only those genomes that have \nat least 1 treatment with 4+ observations at any time')# boxplots
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>%
ggplot(aes(x=treatment, y=iRep)) + geom_boxplot() + facet_wrap(~day) +
ggtitle('iRep estimates by treatment',
'considering only those genomes that have \nat least 1 treatment with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% one_group$bin_day) %>%
ggplot(aes(x=day, y=iRep)) + geom_boxplot() + facet_wrap(~treatment) +
ggtitle('iRep estimates by time',
'considering only those genomes that have \nat least 1 treatment with 4+ observations at any time')### 2 with 4 plus ###
# only includes bins that have at least 2 treatments with 4+ observations
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>% nrow()## [1] 1062
# 1062 observations
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>%
select(genome) %>% unlist() %>% unique() %>% length()## [1] 56
#56 genomes
# histogram
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>%
ggplot(aes(x=iRep)) +
geom_histogram(bins=50)+
theme_bw() +
ggtitle('Histogram of iRep estimates',
'considering only those genomes that have \nat least 2 treatments with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>%
group_by(genome) %>%
tally() %>%
ggplot(aes(x=n))+geom_histogram(bins=30) +
ggtitle('Valid iRep estimates per genome',
'considering only those genomes that have \nat least 2 treatments with 4+ observations at any time')# boxplots
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>%
ggplot(aes(x=treatment, y=iRep)) + geom_boxplot() + facet_wrap(~day) +
ggtitle('iRep estimates by treatment',
'considering only those genomes that have \nat least 2 treatments with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% two_groups$bin_day) %>%
ggplot(aes(x=day, y=iRep)) + geom_boxplot() + facet_wrap(~treatment) +
ggtitle('iRep estimates by time',
'considering only those genomes that have \nat least 2 treatments with 4+ observations at any time')### 3 with 4 plus ###
# only includes bins with 4+ observations in all 3 treatments at any one timepoint
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>% nrow()## [1] 507
# 507 observations
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>%
select(genome) %>% unlist() %>% unique() %>% length()## [1] 22
#22 genomes
# histogram
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>%
ggplot(aes(x=iRep)) +
geom_histogram(bins=50)+
theme_bw() +
ggtitle('Histogram of iRep estimates',
'considering only those genomes that have \nat least 3 treatments with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>%
group_by(genome) %>%
tally() %>%
ggplot(aes(x=n))+geom_histogram(bins=30) +
ggtitle('Valid iRep estimates per genome',
'considering only those genomes that have \nat least 3 treatments with 4+ observations at any time')# boxplots
dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>%
ggplot(aes(x=treatment, y=iRep)) + geom_boxplot() + facet_wrap(~day) +
ggtitle('iRep estimates by treatment',
'considering only those genomes that have \nat least 3 treatments with 4+ observations at any time')dat %>% unite(col = 'bin_day', genome, day, remove = FALSE) %>%
filter(bin_day %in% all_groups$bin_day) %>%
ggplot(aes(x=day, y=iRep)) + geom_boxplot() + facet_wrap(~treatment) +
ggtitle('iRep estimates by time',
'considering only those genomes that have \nat least 3 treatments with 4+ observations at any time')Well, the data aren’t idea, but we’ll try to get something meaningful out of it.
############ DAY 7 DIFFS BTWEEN GRUOPS ##########
# only includes genomes with 4+ observations in all 3 treatment groups at D7
# 8 genomes, 149 observations
dat %>% filter(genome %in% D7$genome & day %in% c('07'))%>%
ggplot(aes(x=treatment, y=iRep, fill=treatment)) + geom_violin() + geom_jitter(shape=21, width = .2)+
ggtitle('iRep growth rate estimates at D7', '') + theme_bw() +
stat_summary(fun.y = "mean", colour = "black", size = 2, geom = "point")dat %>% filter(genome %in% D7$genome & day %in% c('07'))%>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) +
geom_boxplot() +
geom_jitter(shape=21, position=position_dodge2(width = .75)) +
ggtitle('iRep growth rate estimates at D7') + theme_bw()D7_treat_comp <- dat %>% filter(genome %in% D7$genome & day %in% c('07'))
D7_tukeys <- D7_treat_comp %>% group_by(genome) %>% nest() %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ treatment)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr'))
sigs <- D7_tukeys %>%
select(-adj.p.value) %>% filter(tuk_pval < 0.05)
sigs## # A tibble: 6 x 8
## # Groups: genome [5]
## genome term comparison estimate conf.low conf.high tuk_pval fdr.pval
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bin.235 treatme… ther-ctrl -0.186 -0.353 -0.0196 0.0265 0.0795
## 2 bin.886 treatme… ther-ctrl -0.272 -0.477 -0.0673 0.00837 0.0251
## 3 bin.280 treatme… ther-ctrl -0.278 -0.541 -0.0160 0.0367 0.110
## 4 bin.316 treatme… ther-sub -0.162 -0.307 -0.0165 0.0282 0.0845
## 5 bin.231 treatme… sub-ctrl -0.272 -0.459 -0.0843 0.00585 0.00877
## 6 bin.231 treatme… ther-ctrl -0.343 -0.519 -0.168 0.000589 0.00177
sigs_fdr <- sigs %>% filter(fdr.pval < 0.1)
# only those bins with a sig difference detected in tukey's test
dat %>% filter(genome %in% D7$genome & day %in% c('07'))%>%
filter(genome %in% sigs$genome) %>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) +
geom_boxplot() +
geom_jitter(shape=21, position=position_dodge2(width = .75)) +
ggtitle('iRep growth rate estimates at D7', 'tukey pval < 0.05 -- not adjusted for multiple tukeys') + theme_bw()dat %>% filter(genome %in% D7$genome & day %in% c('07'))%>%
filter(genome %in% sigs_fdr$genome) %>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) +
geom_boxplot() +
geom_jitter(shape=21, position=position_dodge2(width = .75)) +
ggtitle('iRep growth rate estimates at D7', 'fdr pval < 0.10') + theme_bw()D7_tukeys %>% #filter(comparison != 'ther-sub') %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between treatments',
'negative estimates indicate higher growth rate in ctrl, D7 only')+ facet_wrap(~comparison)# mixed model? not super confident about this.
summary(lmer(data = D7_treat_comp, formula = iRep ~ treatment + (1|genome)))## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ treatment + (1 | genome)
## Data: D7_treat_comp
##
## REML criterion at convergence: -74.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4914 -0.5265 -0.0456 0.5822 3.7026
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.03276 0.1810
## Residual 0.02796 0.1672
## Number of obs: 149, groups: genome, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.97145 0.06863 8.41736 28.725 1.07e-09 ***
## treatmentsub -0.06411 0.03429 139.22518 -1.870 0.0636 .
## treatmentther -0.20802 0.03394 139.06323 -6.129 8.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtmnts
## treatmentsb -0.260
## treatmntthr -0.262 0.530
checkm %>% filter(bin %in% sigs$genome) %>% select(bin, marker_lineage, Completeness, Contamination)## # A tibble: 5 x 4
## bin marker_lineage Completeness Contamination
## <chr> <chr> <dbl> <dbl>
## 1 bin.886 k__Bacteria_(UID2372) 97.2 0.94
## 2 bin.231 o__Clostridiales_(UID1212) 91.6 1.59
## 3 bin.280 o__Clostridiales_(UID1226) 89.0 0.63
## 4 bin.316 o__Clostridiales_(UID1212) 87.5 0.48
## 5 bin.235 f__Lachnospiraceae_(UID1256) 87.4 0.97
###### DAY 35 DIFF BTWEEN GROUPS #######
# only includes genomes with 4+ observations in all 3 treatment groups at D35
# 11 genomes, 211 observations
D35_treat_comp <- dat %>% filter(genome %in% D35$genome & day %in% c('35'))
D35_treat_comp %>% ggplot(aes(x=treatment, y=iRep, fill=treatment)) +
geom_violin() + geom_jitter(shape=21, width = .2) +
ggtitle('iRep growth rate estimates at D35') + theme_bw()dat %>% filter(genome %in% D35$genome & day %in% c('35'))%>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) + geom_boxplot() +
geom_jitter(shape=21, position=position_dodge2(width = .75)) +
ggtitle('iRep growth rate estimates at D35, by genome') + theme_bw()D35_tukeys <- D35_treat_comp %>% group_by(genome) %>% nest() %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ treatment)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr'))
sigs <- D35_tukeys %>%
select(-adj.p.value) %>% filter(tuk_pval < 0.05)
# no sigs by anova/tukey
D35_tukeys %>% #filter(comparison != 'ther-sub') %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between treatments',
'negative estimates indicate higher growth rate in ctrl, D35 only')+ facet_wrap(~comparison)# mixed model? not super confident about this.
summary(lmer(data = D35_treat_comp, formula = iRep ~ treatment + (1|genome)))## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ treatment + (1 | genome)
## Data: D35_treat_comp
##
## REML criterion at convergence: -281.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.0364 -0.6657 -0.0746 0.5399 3.3851
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.03565 0.1888
## Residual 0.01171 0.1082
## Number of obs: 211, groups: genome, 11
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.693221 0.058269 10.616555 29.059 1.83e-11 ***
## treatmentsub -0.005495 0.018189 198.141196 -0.302 0.763
## treatmentther -0.002010 0.018208 198.108917 -0.110 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtmnts
## treatmentsb -0.144
## treatmntthr -0.143 0.464
# no treatment effect detectable at D35
###### Day 78 Diff BTWEEN GROUPS ######
# only includes genomes with 4+ observations in all 3 treatment groups at D35
# 8 genomes, 147 observations
dat %>% filter(genome %in% D78$genome & day %in% c('78'))%>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) + geom_boxplot() + geom_jitter(shape=21, position=position_dodge2(width = .75))D78_treat_comp <- dat %>% filter(genome %in% D78$genome & day %in% c('78'))
D78_treat_comp %>% ggplot(aes(x=treatment, y=iRep, fill=treatment)) +
geom_violin() + geom_jitter(shape=21, width = .2) +
ggtitle('iRep growth rate estimates at D78') + theme_bw()D78_treat_comp %>%
ggplot(aes(x=genome, y=iRep, fill=treatment)) + geom_boxplot() +
geom_jitter(shape=21, position=position_dodge2(width = .75)) +
ggtitle('iRep growth rate estimates at D78, by genome') + theme_bw()D78_tukeys <- D78_treat_comp %>% group_by(genome) %>% nest() %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ treatment)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr'))
sigs <- D78_tukeys %>%
select(-adj.p.value) %>% filter(tuk_pval < 0.05)
D78_tukeys %>% filter(comparison != 'ther-sub') %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between treatments',
'negative estimates indicate higher growth rate in ctrl, D35 only')+ facet_wrap(~comparison)# mixed model? not super confident about this.
summary(lmer(data = D78_treat_comp, formula = iRep ~ treatment + (1|genome)))## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ treatment + (1 | genome)
## Data: D78_treat_comp
##
## REML criterion at convergence: -198.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.40673 -0.65882 -0.06031 0.68226 2.45716
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.01039 0.1019
## Residual 0.01189 0.1090
## Number of obs: 147, groups: genome, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.64378 0.03964 9.03716 41.466 1.27e-11 ***
## treatmentsub -0.07263 0.02256 137.22989 -3.220 0.0016 **
## treatmentther -0.05829 0.02246 137.41201 -2.595 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) trtmnts
## treatmentsb -0.302
## treatmntthr -0.305 0.528
# small treatment effect detected, both sub and ther, but at this timepoint they are receiving equal doses.
########### END TREATMENT EFFECT ############ CONTROL ONLY DIFFERENCES BETWEEN GROWTH RATES AT D7 AND D35
# only includes genomes with 4+ observations at both D7 and D35 in the control treatment only
# 15 genomes 201 observations
dat %>% filter(genome %in% ctrl735$genome & day %in% c('07', '35')& treatment == 'ctrl')%>%
ggplot(aes(x=day, y=iRep, fill=day)) +
geom_violin() +
ggtitle('iRep estimates at D7 and D35, control only', '15 genomes, 201 observations')# looks to be a trend to higher growth rates in d7 communities
dat %>% filter(genome %in% ctrl735$genome & day %in% c('07', '35') & treatment == 'ctrl')%>%
ggplot(aes(x=genome, y=iRep, fill=day)) + geom_boxplot() +
ggtitle('iRep estimates at D7 and D35, control only', '15 genomes, 201 observations')ctrl735_dat <- dat %>% filter(genome %in% ctrl735$genome & day %in% c('07', '35')& treatment == 'ctrl')
ctrl735_tests <- dat %>%
filter(genome %in% ctrl735$genome & day %in% c('07', '35')& treatment == 'ctrl') %>%
group_by(genome) %>%
nest()
# fit a simple linear model on each genome
# ctrl735_lms <- ctrl735_tests %>% mutate(lms=map(data, ~ lm(data=. , formula = iRep ~ day)),
# tid_sum = map(lms, tidy)) %>% select(genome, tid_sum) %>%
# unnest(cols = c('tid_sum'))#%>% filter(term == 'day35' & p.value < 0.05)
#
# ctrl735_lms %>%
# filter(term == 'day35' & p.value < 0.05)
# Of the 15 genomes, there is evidence that 5 of them have lower growth rates at d35 relative to D7
# ANOVA for each genome
ctrl735_tuk <- ctrl735_tests %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ day)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr')) %>%
select(-adj.p.value) #%>% filter(tuk_pval < 0.05)
ctrl735_tuk %>%
filter(tuk_pval < 0.05)## # A tibble: 5 x 8
## # Groups: genome [5]
## genome term comparison estimate conf.low conf.high tuk_pval fdr.pval
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bin.1044 day 35-07 -0.370 -0.538 -0.202 0.000234 0.000234
## 2 bin.235 day 35-07 -0.172 -0.295 -0.0493 0.00919 0.00919
## 3 bin.735 day 35-07 -0.203 -0.379 -0.0265 0.0273 0.0273
## 4 bin.316 day 35-07 -0.199 -0.324 -0.0747 0.00407 0.00407
## 5 bin.231 day 35-07 -0.170 -0.322 -0.0187 0.0317 0.0317
ctrl735_tuk %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between D7 and D35',
'negative estimates indicate higher growth rate at D7, Control treatment only')## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ day + (1 | genome)
## Data: ctrl735_dat
##
## REML criterion at convergence: -12.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.2514 -0.4663 -0.1007 0.3758 8.2607
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.06485 0.2547
## Residual 0.04246 0.2061
## Number of obs: 201, groups: genome, 15
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.92960 0.06918 15.56667 27.891 1.03e-14 ***
## day35 -0.17779 0.02946 185.48691 -6.035 8.45e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day35 -0.224
# Time effect detectable in the control group. Lower growth-rate at D35 compared to D7
### SUBTHER ONLY DIFFERENCES BETWEEN GROWTH RATES AT D7 AND D35
# only includes genomes with 4+ observations at both D7 and D35 in the control treatment only
# 8 genomes 94 observations
dat %>% filter(genome %in% sub735$genome & day %in% c('07', '35')& treatment == 'sub')%>%
ggplot(aes(x=genome, y=iRep, fill=day)) + geom_boxplot() +
ggtitle('iRep estimates at D7 and D35, sub only', '8 genomes, 94 observations')dat %>% filter(genome %in% sub735$genome & day %in% c('07', '35')& treatment == 'sub')%>%
ggplot(aes(x=day, y=iRep, fill=day)) + geom_violin() +
ggtitle('iRep estimates at D7 and D35, sub only', '8 genomes, 94 observations')# dat$treatment
sub735_dat <- dat %>% filter(genome %in% sub735$genome & day %in% c('07', '35')& treatment == 'sub')
sub735_tests <- dat %>%
filter(genome %in% sub735$genome & day %in% c('07', '35')& treatment == 'sub') %>%
group_by(genome) %>%
nest()
# ANOVA for each genome
sub735_tuk <- sub735_tests %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ day)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr')) %>%
select(-adj.p.value) #%>% filter(tuk_pval < 0.05)
sub735_tuk %>%
filter(tuk_pval < 0.05)## # A tibble: 3 x 8
## # Groups: genome [3]
## genome term comparison estimate conf.low conf.high tuk_pval fdr.pval
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 bin.493 day 35-07 -0.234 -0.374 -0.0931 0.00567 0.00567
## 2 bin.735 day 35-07 -0.193 -0.304 -0.0833 0.00325 0.00325
## 3 bin.316 day 35-07 -0.220 -0.375 -0.0646 0.00979 0.00979
sub735_tuk %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between D7 and D35',
'negative estimates indicate higher growth rate at D7, Control treatment only')## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ day + (1 | genome)
## Data: sub735_dat
##
## REML criterion at convergence: -80
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.63109 -0.58720 -0.02263 0.68275 2.95601
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.04323 0.2079
## Residual 0.01746 0.1321
## Number of obs: 94, groups: genome, 8
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.82949 0.07612 7.58090 24.033 1.97e-08 ***
## day35 -0.10652 0.02806 85.39590 -3.796 0.000274 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day35 -0.185
# Time effect detectable in the sub group. Lower growth-rate at D35 compared to D7
### THER ONLY DIFFERENCES BETWEEN GROWTH RATES AT D7 AND D35
# only includes genomes with 4+ observations at both D7 and D35 in the control treatment only
# 11 genomes 131 observations
dat %>% filter(genome %in% ther735$genome & day %in% c('07', '35')& treatment == 'ther')%>%
ggplot(aes(x=genome, y=iRep, fill=day)) + geom_boxplot() +
ggtitle('iRep estimates at D7 and D35, ther only', '11 genomes, 131 observations')dat %>% filter(genome %in% ther735$genome & day %in% c('07', '35')& treatment == 'ther')%>%
ggplot(aes(x=day, y=iRep, fill=day)) + geom_violin() +
ggtitle('iRep estimates at D7 and D35, ther only', '11 genomes, 131 observations')# dat$treatment
ther735_dat <- dat %>% filter(genome %in% ther735$genome & day %in% c('07', '35')& treatment == 'ther')
ther735_tests <- dat %>%
filter(genome %in% ther735$genome & day %in% c('07', '35')& treatment == 'ther') %>%
group_by(genome) %>%
nest()
# ANOVA for each genome
ther735_tuk <- ther735_tests %>%
mutate(ANOVA=map(data, ~ aov(data=., iRep ~ day)),
tuk = map(ANOVA, TukeyHSD),
tid_tuk = map(tuk, tidy)) %>%
select(genome, tid_tuk) %>%
unnest(cols = tid_tuk) %>%
mutate(tuk_pval = adj.p.value,
fdr.pval = p.adjust(tuk_pval, method = 'fdr')) %>%
select(-adj.p.value) #%>% filter(tuk_pval < 0.05)
ther735_tuk %>%
filter(tuk_pval < 0.05)## # A tibble: 0 x 8
## # Groups: genome [0]
## # … with 8 variables: genome <chr>, term <chr>, comparison <chr>,
## # estimate <dbl>, conf.low <dbl>, conf.high <dbl>, tuk_pval <dbl>,
## # fdr.pval <dbl>
# Of the 11 genomes, there is evidence that 0 of them have lower growth rates at d35 relative to D7
ther735_tuk %>%
ggplot(aes(x=genome, y=estimate)) +
geom_point() +
geom_errorbar(aes(ymin=conf.low, ymax=conf.high), width=.2) +
coord_flip() + geom_hline(yintercept = 0, color='red') +
ggtitle('95% confidence intervals for the difference in growth rate between D7 and D35',
'negative estimates indicate higher growth rate at D7, Therapeutic treatment only')## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: iRep ~ day + (1 | genome)
## Data: ther735_dat
##
## REML criterion at convergence: -111.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.9844 -0.5342 -0.1127 0.4230 4.7538
##
## Random effects:
## Groups Name Variance Std.Dev.
## genome (Intercept) 0.03131 0.1769
## Residual 0.01828 0.1352
## Number of obs: 131, groups: genome, 11
##
## Fixed effects:
## Estimate Std. Error df t value Pr(>|t|)
## (Intercept) 1.67635 0.05592 10.98071 29.977 6.96e-12 ***
## day35 0.02964 0.02387 119.28428 1.242 0.217
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr)
## day35 -0.208